Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris finibus leo et nulla maximus auctor. Nam at sapien porta, vulputate leo et, accumsan ex.
This analysis follows the six steps of the data analysis process: ask, prepare, process, analyze, share, and act.
The first step in the process is to ask the right questions in order to find the right solution to the problem. It includes defining business task and identify key stakeholders in the project.
The goal of this project is to identify the behavioural differences between the Cyclistic members and the Casual riders among the stations. This project will focus on the trip data according to the hour, day, month and day of the week for both of the customer groups of the year 2022. The final deliverable will recommend 3 suggestions to convert Casual riders into Cyclistic members.
Our business task is to understand how annual members and casual riders use Cyclistic bikes differently. It also aims to extract insights and develop the most appropriate marketing strategies that appeal to casual riders and encourage them to subscribe to annual memberships.
Three stakeholders include in the project:
The second step is to prepare and collect the data. It includes the following tasks:
Each historical bike trip dataset contains the following fields:
It is important to set up and import the data in the environment after downloading and storing CSV files in the desired location. RStudio and R programming will be utilized at this step.
First, it requires installing and loading the tidyverse
library. It is used for the following tasks:
readr,
tidyr, and dplyr.ggplot2.# Load necessary library
library(tidyverse)
After the environment has already been set up, the next step is to
import all 12-month data and then merge them into a single data frame,
called trip_data. The read_csv
and bind_rows functions are used to read and combine each
file.
# Import CSV files into 'trip_data' data frame
trip_data <- list.files(path = "./data/",
pattern = "*-divvy-tripdata.csv",
full.names = TRUE) %>%
lapply(read_csv) %>%
bind_rows %>%
arrange(started_at)
The data is ready to use for the next step. Before going to the next
process, let’s preview the data frame using glimpse() and
head() functions. From the below output, the data frame
contains 5,667,717 rows and 13 columns. It also provide the following
data type in each column.
ride_id,
rideable_type, start_station_name,
start_station_id, end_station_name,
end_station_id, and member_casual.started_at and
ended_at.start_lat,
start_lng, end_lat, and
end_lng.# Inspect data frame using 'glimpse' function
glimpse(trip_data)
## Rows: 5,667,717
## Columns: 13
## $ ride_id <chr> "98D355D9A9852BE9", "04706CA7F5BD25EE", "42178E850B…
## $ rideable_type <chr> "classic_bike", "electric_bike", "electric_bike", "…
## $ started_at <dttm> 2022-01-01 00:00:05, 2022-01-01 00:01:00, 2022-01-…
## $ ended_at <dttm> 2022-01-01 00:01:48, 2022-01-01 00:04:39, 2022-01-…
## $ start_station_name <chr> "Michigan Ave & 8th St", "Broadway & Waveland Ave",…
## $ start_station_id <chr> "623", "13325", "TA1305000009", "623", "623", "1343…
## $ end_station_name <chr> "Michigan Ave & 8th St", "Broadway & Barry Ave", "C…
## $ end_station_id <chr> "623", "13137", "TA1305000009", "623", "623", "1343…
## $ start_lat <dbl> 41.87277, 41.94907, 41.87592, 41.87277, 41.87277, 4…
## $ start_lng <dbl> -87.62398, -87.64863, -87.63119, -87.62398, -87.623…
## $ end_lat <dbl> 41.87277, 41.93758, 41.87593, 41.87277, 41.87277, 4…
## $ end_lng <dbl> -87.62398, -87.64410, -87.63058, -87.62398, -87.623…
## $ member_casual <chr> "casual", "casual", "casual", "casual", "casual", "…
# Inspect data frame using 'head' function
head(trip_data)
# Create 'trip_data_v2' data frame in order to use it for cleaning data
trip_data_v2 <- trip_data
# Check duplicate values of 'ride_id' in 'trip_data_v2' data frame
sum(duplicated(trip_data_v2$ride_id))
## [1] 0
rideable_type# Check unique type of bikes
bike_type <- count(trip_data_v2, rideable_type, name = "ride_count")
bike_type
# Change values from 'docked_bike' to 'classic_bike' and assign back to 'trip_data_v2' data frame
trip_data_v2 <- trip_data_v2 %>%
mutate(rideable_type = str_replace_all(rideable_type, "docked_bike", "classic_bike"))
# Recheck unique type of bikes
bike_type_v2 <- count(trip_data_v2, rideable_type, name = "ride_count")
bike_type_v2
member_casual# Check unique type of members
member_type <- count(trip_data_v2, member_casual, name = "ride_count")
member_type
start_station_name and
end_station_name# Recheck unique start stations and assign to 'start_station_v3' variable
start_station_v3 <- trip_data_v2 %>%
count(start_lat, start_lng, start_station_name, name = "station_count") %>%
arrange(start_lat, start_lng)
start_station_v3
# Recheck unique end stations and assign to 'end_station_v3' variable
end_station_v3 <- trip_data_v2 %>%
count(end_lat, end_lng, end_station_name, name = "station_count") %>%
arrange(end_lat, end_lng)
end_station_v3
# List all test stations and assign to 'test_station_list' variable
test_station_list <- c("Pawel Bialowas - Test- PBSC charging station",
"Hastings WH 2",
"DIVVY CASSETTE REPAIR MOBILE STATION",
"Base - 2132 W Hubbard Warehouse",
"Base - 2132 W Hubbard",
"NewHastings",
"WestChi",
"WEST CHI-WATSON")
# Remove test stations and assign back to 'trip_data_v2' data frame
trip_data_v2 <- trip_data_v2 %>%
filter(!(trip_data_v2$start_station_name %in% test_station_list |
trip_data_v2$end_station_name %in% test_station_list))
# Check number of rows and columns
dim(trip_data_v2)
## [1] 5665349 13
# List all inconsistent words and assign to 'test_station_list' variable
words <- c("*", " - Charging", " (Temp)", "amp;", "Public Rack - ")
# " - midblock",
# " - North", " - N", " N", "- north corner",
# " - South", " - S", " S","- south corner", " - midblock south",
# " - East", " - E", " E",
# " - West", " - W", " W",
# " - NE", " - NW", " - SE", " - SW",
# " NE", " NW", " SE", " SW"
# Repeat word in the 'words' list
for (word in words) {
# Change specific texts to make them all aligned and assign back to 'trip_data_v2' data frame
trip_data_v2 <- trip_data_v2 %>%
mutate(start_station_name = str_replace_all(start_station_name, fixed(word), "")) %>%
mutate(end_station_name = str_replace_all(end_station_name, fixed(word), ""))
}
# Recheck unique start stations and assign to 'start_station_v4' variable
start_station_v4 <- trip_data_v2 %>%
count(start_lat, start_lng, start_station_name, name = "station_count") %>%
arrange(start_lat, start_lng)
start_station_v4
# Recheck unique end stations and assign to 'end_station_v4' variable
end_station_v4 <- trip_data_v2 %>%
count(end_lat, end_lng, end_station_name, name = "station_count") %>%
arrange(end_lat, end_lng)
end_station_v4
# Check missing values each column
na_col_count <- data.frame(colSums(is.na(trip_data_v2)))
colnames(na_col_count)[1] <- "na_count"
na_col_count
# Check unique start stations and assign to 'start_station' variable
start_station <- trip_data_v2 %>%
count(start_lat, start_lng, start_station_name, name = "station_count") %>%
arrange(start_lat, start_lng)
start_station
# Check unique end stations and assign to 'end_station' variable
end_station <- trip_data_v2 %>%
count(end_lat, end_lng, end_station_name, name = "station_count") %>%
arrange(end_lat, end_lng)
end_station
# Impute missing start/end station names and IDs by using geographic coordinate as reference
digit <- 5 # Set digits to start with 5
# Repeat rounding digits from 5 to 2
while (digit > 1) {
# Add four columns to round digits
trip_data_v2 <- trip_data_v2 %>%
mutate(start_lat_round = round(start_lat, digits = digit),
start_lng_round = round(start_lng, digits = digit),
end_lat_round = round(end_lat, digits = digit),
end_lng_round = round(end_lng, digits = digit))
# Fill in missing start station names & IDs, referencing with 'start_lat_round' and 'start_lng_round'
trip_data_v2 <- trip_data_v2 %>%
group_by(start_lat_round, start_lng_round) %>%
fill(start_station_name, .direction = "downup") %>%
fill(start_station_id, .direction = "downup") %>%
ungroup()
# Fill in missing end station names & IDs, referencing with 'end_lat_round' and 'end_lng_round'
trip_data_v2 <- trip_data_v2 %>%
group_by(end_lat_round, end_lng_round) %>%
fill(end_station_name, .direction = "downup") %>%
fill(end_station_id, .direction = "downup") %>%
ungroup()
# Decrement digits by 1
digit <- digit - 1
}
# Check missing values each column
na_col_count_v2 <- data.frame(colSums(is.na(trip_data_v2)))
colnames(na_col_count_v2)[1] <- "na_count"
na_col_count_v2
# Recheck unique start stations and assign to 'start_station_v2' variable
start_station_v2 <- trip_data_v2 %>%
count(start_lat, start_lng, start_station_name, name = "station_count") %>%
arrange(start_lat, start_lng)
start_station_v2
# Recheck unique end stations and assign to 'end_station_v2' variable
end_station_v2 <- trip_data_v2 %>%
count(end_lat, end_lng, end_station_name, name = "station_count") %>%
arrange(end_lat, end_lng)
end_station_v2
# Remove rows containing missing values and assign back to 'trip_data_v2' data frame
trip_data_v2 <- drop_na(trip_data_v2)
# Check number of rows and columns
dim(trip_data_v2)
## [1] 5614669 17
# Add three columns into data frame: 'ride_length_min', 'day_of_week', and 'month'
trip_data_v2$ride_length_min <- as.double(difftime(trip_data_v2$ended_at,
trip_data_v2$started_at,
units = "mins"))
trip_data_v2$day_of_week <- wday(trip_data_v2$started_at, label = TRUE)
trip_data_v2$month <- format(trip_data_v2$started_at, "%b")
# Order levels of 'day_of_week' from Monday to Sunday
trip_data_v2$day_of_week <- ordered(trip_data_v2$day_of_week,
levels = c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"))
# Order levels of 'month' from January to December
trip_data_v2$month <- ordered(trip_data_v2$month,
levels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
# Inspect data frame using 'head' function
head(trip_data_v2)
# Drop irrelevant columns and assign back to 'trip_data_v2' data frame
trip_data_v2 <- trip_data_v2 %>%
select(!c(start_station_id, end_station_id,
start_lat_round, start_lng_round,
end_lat_round, end_lng_round))
# Check number of rows and columns
dim(trip_data_v2)
## [1] 5614669 14
# Inspect data frame using 'head' function
head(trip_data_v2)
trip_data_v2 <- trip_data_v2 %>%
filter( !(trip_data_v2$start_lat == 0 |
trip_data_v2$start_lng == 0 |
trip_data_v2$end_lat == 0 |
trip_data_v2$end_lng == 0) )
# Check number of rows and columns
dim(trip_data_v2)
## [1] 5614661 14
summary(trip_data_v2$ride_length_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10353.35 5.80 10.25 16.29 18.38 32035.45
trip_data_v2 <- trip_data_v2 %>%
filter(!(trip_data_v2$ride_length_min < 1 | trip_data_v2$ride_length_min > 1440))
# Check number of rows and columns
dim(trip_data_v2)
## [1] 5494500 14
summary(trip_data_v2$ride_length_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.05 10.48 16.55 18.67 1439.37
ggplot(data = trip_data_v2, aes(x = member_casual, y = ride_length_min, fill = member_casual)) +
geom_boxplot() +
coord_flip() +
theme(legend.position="none") +
labs(x = "Member type",
y = "Ride length (in minutes)",
title = "Box plot showing 'ride_length_min' before removing outliers")
median_value <- median(trip_data_v2$ride_length_min)
q1 <- as.numeric(quantile(trip_data_v2$ride_length_min, probs = 0.25))
q3 <- as.numeric(quantile(trip_data_v2$ride_length_min, probs = 0.75))
iqr_value <- IQR(trip_data_v2$ride_length_min)
lower_limit <- q1 - ( 1.5 * iqr_value )
upper_limit <- q3 + ( 1.5 * iqr_value )
trip_data_v2 <- trip_data_v2 %>%
filter(!(trip_data_v2$ride_length_min < lower_limit | trip_data_v2$ride_length_min > upper_limit))
summary(trip_data_v2$ride_length_min)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.783 9.717 11.958 16.150 37.583
ggplot(data = trip_data_v2, aes(x = member_casual, y = ride_length_min, fill = member_casual)) +
geom_boxplot() +
coord_flip() +
theme(legend.position="none") +
labs(x = "Member type",
y = "Ride length (in minutes)",
title = "Box plot showing 'ride_length_min' after removing outliers")
# Inspect data frame using 'glimpse' function
glimpse(trip_data_v2)
## Rows: 5,091,142
## Columns: 14
## $ ride_id <chr> "98D355D9A9852BE9", "04706CA7F5BD25EE", "42178E850B…
## $ rideable_type <chr> "classic_bike", "electric_bike", "electric_bike", "…
## $ started_at <dttm> 2022-01-01 00:00:05, 2022-01-01 00:01:00, 2022-01-…
## $ ended_at <dttm> 2022-01-01 00:01:48, 2022-01-01 00:04:39, 2022-01-…
## $ start_station_name <chr> "Michigan Ave & 8th St", "Broadway & Waveland Ave",…
## $ end_station_name <chr> "Michigan Ave & 8th St", "Broadway & Barry Ave", "C…
## $ start_lat <dbl> 41.87277, 41.94907, 41.87592, 41.87277, 41.87277, 4…
## $ start_lng <dbl> -87.62398, -87.64863, -87.63119, -87.62398, -87.623…
## $ end_lat <dbl> 41.87277, 41.93758, 41.87593, 41.87277, 41.87277, 4…
## $ end_lng <dbl> -87.62398, -87.64410, -87.63058, -87.62398, -87.623…
## $ member_casual <chr> "casual", "casual", "casual", "casual", "casual", "…
## $ ride_length_min <dbl> 1.716667, 3.650000, 30.966667, 28.883333, 28.483333…
## $ day_of_week <ord> Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, Sat, S…
## $ month <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, Jan, J…
#
summary_stats <- summarise(trip_data_v2,
sd = sd(ride_length_min),
mean = mean(ride_length_min),
count = n())
#
calculate_percentage <- function(n_sd) {
filtered_count <- trip_data_v2 %>%
filter(between(ride_length_min, summary_stats$mean - n_sd * summary_stats$sd, summary_stats$mean + n_sd * summary_stats$sd)) %>%
summarise(count = n())
round((filtered_count$count / summary_stats$count) * 100, 2)
}
#
percentage_sd1 <- calculate_percentage(1)
percentage_sd2 <- calculate_percentage(2)
percentage_sd3 <- calculate_percentage(3)
paste0("One standard deviation: ", format(percentage_sd1, nsmall = 2), "%")
## [1] "One standard deviation: 72.10%"
paste0("Two standard deviations: ", percentage_sd2, "%")
## [1] "Two standard deviations: 94.08%"
paste0("Three standard deviations: ", percentage_sd3, "%")
## [1] "Three standard deviations: 99.38%"